{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a8ef9ad9-c7e1-4fed-b0bd-581064558089",
   "metadata": {},
   "source": [
    "# GPT-3.5-Turbo Performance on MMLU - Business Ethics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "4f1c09a4-4859-469b-a156-dbb037c83a65",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import openai\n",
    "import re\n",
    "import time\n",
    "import json\n",
    "\n",
    "import numpy as np\n",
    "\n",
    "from tqdm import tqdm\n",
    "from datasets import load_dataset\n",
    "from tenacity import retry, stop_after_attempt, wait_chain, wait_fixed"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "eaea2a2a-2515-4508-9cdb-084d10853170",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "openai.api_key = \"sk-\" "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "95ddd688-5bf5-40c5-a852-32e62f5a1bbb",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "@retry(wait=wait_chain(*[wait_fixed(3) for i in range(3)] +\n",
    "                       [wait_fixed(5) for i in range(2)] +\n",
    "                       [wait_fixed(10)]))\n",
    "def completion_with_backoff(**kwargs):\n",
    "    return openai.ChatCompletion.create(**kwargs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "d4063503-c0e0-4df7-9866-0815f4c9fdf0",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "mmlu_prompt = json.load(open('lib_prompt/mmlu-cot.json'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "bbbbe828-fad8-48a9-9cab-4a7e675ecf9e",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The following are multiple choice questions (with answers) about business ethics.\n",
      "\n",
      "Q: In contrast to _______, _______ aim to reward favourable behaviour by companies. The success of such campaigns have been heightened through the use of ___________, which allow campaigns to facilitate the company in achieving _________ .\n",
      "(A) Buycotts, Boycotts, Blockchain technology, Charitable donations (B) Buycotts, Boycotts, Digital technology, Increased Sales (C) Boycotts, Buyalls, Blockchain technology, Charitable donations (D) Boycotts, Buycotts, Digital technology, Increased Sales\n",
      "A: Let's think step by step. We refer to Wikipedia articles on business ethics for help. The sentence that best uses the possible options above is “In contrast to *boycotts*, *buycotts* aim to reward favourable behavior by companies. The success of such campaigns have been heightened through the use of *digital technology*, which allow campaigns to facilitate the company in achieving *increased sales*.” The answer is (D).\n",
      "\n",
      "Q: _______ is the direct attempt to formally or informally manage ethical issues or problems, through specific policies, practices and programmes.\n",
      "(A) Corporate social responsibility (B) Business ethics management (C) Sustainability (D) Environmental management\n",
      "A: Let's think step by step. We refer to Wikipedia articles on business ethics for help. The direct attempt manage ethical issues through specific policies, practices, and programs is business ethics management. The answer is (B).\n",
      "\n",
      "Q: Three contrasting tactics that CSO's can engage in to meet their aims are ________ which typically involves research and communication, ________, which may involve physically attacking a company's operations or ________, often involving some form of _______.\n",
      "(A) Non-violent direct action, Violent direct action, Indirect action, Boycott (B) Indirect action, Instrumental action, Non-violent direct action, Information campaign (C) Indirect action, Violent direct action, Non-violent direct-action Boycott (D) Non-violent direct action, Instrumental action, Indirect action, Information campaign\n",
      "A: Let's think step by step. We refer to Wikipedia articles on business ethics for help. The sentence that best uses the possible options above is “Three contrasting tactics that CSO's can engage in to meet their aims are *indirect action*, which typically involves research and communication, *violent direct action*, which may involve physically attacking a company's operations or *non-violent direct action*, often involving some form of *boycott*.” The answer is (C).\n",
      "\n",
      "Q: To ensure the independence of the non-executive board members, there are a number of steps which can be taken, which include non-executives being drawn from _______ the company, being appointed for a _________ time period as well as being appointed _________.\n",
      "(A) Outside, Limited, Independently (B) Inside, Limited, Intermittently (C) Outside, Unlimited, Intermittently (D) Inside, Unlimited, Independently\n",
      "A: Let's think step by step. We refer to Wikipedia articles on business ethics for help. The sentence that best uses the possible options above is “To ensure the independence of the non-executive board members, there are a number of steps which can be taken, which include non-executives being draw from *outside* the company, being appointed for a *limited* time period as well as being imported *independently*. The answer is (A).\n",
      "\n",
      "Q: Beyond the business case for engaging in CSR there are a number of moral arguments relating to: negative _______, the _______that corporations possess and the ________ of business and society.\n",
      "(A) Externalities, Power, Independence (B) Publicity, Insubstantial resources, Mutual dependence (C) Publicity, Power, Independence (D) Externalities, Power, Mutual dependence\n",
      "A: Let's think step by step. We refer to Wikipedia articles on business ethics for help. The sentence that best uses the possible options above is “Beyond the business case for engaging the CSR there are a number of moral arguments relating to: negative *externalities*, the *power* that corporations possess and the *mutual independence* of business and society. The answer is (D).\n"
     ]
    }
   ],
   "source": [
    "task = 'business_ethics'\n",
    "print(mmlu_prompt[task])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "ce1c6b79-3530-4efe-bfd6-eead27d09ff3",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading and preparing dataset mmlu/business_ethics to /Users/yaofu/.cache/huggingface/datasets/lukaemon___mmlu/business_ethics/1.0.0/134145dc2582b9a08b42d1f4b828f84a0066e9cc2e7dd8c1d83bee475746ecc3...\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "20fea07ed3d24a3aa09cab28f22d4579",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Generating test split:   0%|          | 0/99 [00:00<?, ? examples/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Generating validation split:   0%|          | 0/10 [00:00<?, ? examples/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Generating train split:   0%|          | 0/4 [00:00<?, ? examples/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dataset mmlu downloaded and prepared to /Users/yaofu/.cache/huggingface/datasets/lukaemon___mmlu/business_ethics/1.0.0/134145dc2582b9a08b42d1f4b828f84a0066e9cc2e7dd8c1d83bee475746ecc3. Subsequent calls will reuse this data.\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "ad6e0e70a43846df8f0c1a5bdc017a93",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/3 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "task_data = load_dataset(\"lukaemon/mmlu\", task)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "0957327f-9454-4010-9501-b7086fbba125",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "99"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(task_data['test'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "5b18d727-13a7-45cd-a842-3462636d9c25",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'input': 'Typical advertising regulatory bodies suggest, for example that adverts must not: encourage _________, cause unnecessary ________ or _____, and must not cause _______ offence.',\n",
       " 'A': 'Unsafe practices, Wants, Fear, Trivial',\n",
       " 'B': 'Unsafe practices, Distress, Fear, Serious',\n",
       " 'C': 'Safe practices, Wants, Jealousy, Trivial',\n",
       " 'D': 'Safe practices, Distress, Jealousy, Serious',\n",
       " 'target': 'B'}"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "task_data['test'][0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "bec64ac3-96b4-45cd-b79e-6c8ad6fae234",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "prompt_q = mmlu_prompt[task] + \"\\n\\n\" + task_data['test'][0]['input'] + '\\n'\n",
    "for letter in ['A', 'B', 'C', 'D']:\n",
    "    prompt_q += '(' + letter + ') ' + task_data['test'][0][letter] + ' '\n",
    "prompt_q += \"\\nA: Let's think step by step.\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "8847b4b5-99fe-47ab-941b-5bd02c43755e",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The following are multiple choice questions (with answers) about business ethics.\n",
      "\n",
      "Q: In contrast to _______, _______ aim to reward favourable behaviour by companies. The success of such campaigns have been heightened through the use of ___________, which allow campaigns to facilitate the company in achieving _________ .\n",
      "(A) Buycotts, Boycotts, Blockchain technology, Charitable donations (B) Buycotts, Boycotts, Digital technology, Increased Sales (C) Boycotts, Buyalls, Blockchain technology, Charitable donations (D) Boycotts, Buycotts, Digital technology, Increased Sales\n",
      "A: Let's think step by step. We refer to Wikipedia articles on business ethics for help. The sentence that best uses the possible options above is “In contrast to *boycotts*, *buycotts* aim to reward favourable behavior by companies. The success of such campaigns have been heightened through the use of *digital technology*, which allow campaigns to facilitate the company in achieving *increased sales*.” The answer is (D).\n",
      "\n",
      "Q: _______ is the direct attempt to formally or informally manage ethical issues or problems, through specific policies, practices and programmes.\n",
      "(A) Corporate social responsibility (B) Business ethics management (C) Sustainability (D) Environmental management\n",
      "A: Let's think step by step. We refer to Wikipedia articles on business ethics for help. The direct attempt manage ethical issues through specific policies, practices, and programs is business ethics management. The answer is (B).\n",
      "\n",
      "Q: Three contrasting tactics that CSO's can engage in to meet their aims are ________ which typically involves research and communication, ________, which may involve physically attacking a company's operations or ________, often involving some form of _______.\n",
      "(A) Non-violent direct action, Violent direct action, Indirect action, Boycott (B) Indirect action, Instrumental action, Non-violent direct action, Information campaign (C) Indirect action, Violent direct action, Non-violent direct-action Boycott (D) Non-violent direct action, Instrumental action, Indirect action, Information campaign\n",
      "A: Let's think step by step. We refer to Wikipedia articles on business ethics for help. The sentence that best uses the possible options above is “Three contrasting tactics that CSO's can engage in to meet their aims are *indirect action*, which typically involves research and communication, *violent direct action*, which may involve physically attacking a company's operations or *non-violent direct action*, often involving some form of *boycott*.” The answer is (C).\n",
      "\n",
      "Q: To ensure the independence of the non-executive board members, there are a number of steps which can be taken, which include non-executives being drawn from _______ the company, being appointed for a _________ time period as well as being appointed _________.\n",
      "(A) Outside, Limited, Independently (B) Inside, Limited, Intermittently (C) Outside, Unlimited, Intermittently (D) Inside, Unlimited, Independently\n",
      "A: Let's think step by step. We refer to Wikipedia articles on business ethics for help. The sentence that best uses the possible options above is “To ensure the independence of the non-executive board members, there are a number of steps which can be taken, which include non-executives being draw from *outside* the company, being appointed for a *limited* time period as well as being imported *independently*. The answer is (A).\n",
      "\n",
      "Q: Beyond the business case for engaging in CSR there are a number of moral arguments relating to: negative _______, the _______that corporations possess and the ________ of business and society.\n",
      "(A) Externalities, Power, Independence (B) Publicity, Insubstantial resources, Mutual dependence (C) Publicity, Power, Independence (D) Externalities, Power, Mutual dependence\n",
      "A: Let's think step by step. We refer to Wikipedia articles on business ethics for help. The sentence that best uses the possible options above is “Beyond the business case for engaging the CSR there are a number of moral arguments relating to: negative *externalities*, the *power* that corporations possess and the *mutual independence* of business and society. The answer is (D).\n",
      "\n",
      "Typical advertising regulatory bodies suggest, for example that adverts must not: encourage _________, cause unnecessary ________ or _____, and must not cause _______ offence.\n",
      "(A) Unsafe practices, Wants, Fear, Trivial (B) Unsafe practices, Distress, Fear, Serious (C) Safe practices, Wants, Jealousy, Trivial (D) Safe practices, Distress, Jealousy, Serious \n",
      "A: Let's think step by step.\n"
     ]
    }
   ],
   "source": [
    "print(prompt_q)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "c91e1e78-0e6d-4d82-9600-b6c76637fa80",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "response = openai.ChatCompletion.create(\n",
    "    model=\"gpt-3.5-turbo\",\n",
    "    messages=[\n",
    "        {\"role\": \"system\", \"content\": \"Follow the given examples and answer the question.\"},\n",
    "        {\"role\": \"user\", \"content\": prompt_q},\n",
    "    ],\n",
    "    temperature=0, \n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "b1e7b952-30b3-43b3-aa11-06595a7f592c",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "We refer to Wikipedia articles on advertising regulation for help. The sentence that best uses the possible options above is “Typical advertising regulatory bodies suggest, for example that adverts must not: encourage *unsafe practices*, cause unnecessary *distress* or *fear*, and must not cause *serious* offence.” The answer is (B).\n"
     ]
    }
   ],
   "source": [
    "print(response['choices'][0]['message']['content'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "867945a0-6527-409e-801e-4244eaf75e61",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def test_answer_mmlu(pred_str, ans_str):\n",
    "    pattern = 'the answer is ('\n",
    "    pred = pred_str.lower().split(pattern)\n",
    "    \n",
    "    if(len(pred) > 1):\n",
    "        # print(pred)\n",
    "        pred = pred[1][0]\n",
    "        gold = ans_str.split('A:\\n')[1][0].lower()\n",
    "        # print('debug 1, pred %s, gold %s' % (pred, gold))\n",
    "        return pred == gold\n",
    "    else: \n",
    "        pred = 'C'\n",
    "        gold = ans_str.split('A:\\n')[1][0].lower()\n",
    "        # print('debug 2, pred %s, gold %s' % (pred, gold))\n",
    "        return pred == gold\n",
    "\n",
    "def parse_pred_ans(filename):\n",
    "    with open(filename) as fd: lines = fd.readlines()\n",
    "    am, a = None, None\n",
    "    num_q, acc = 0, 0\n",
    "    current_mode = 'none'\n",
    "    questions = []\n",
    "    ans_pred = []\n",
    "    ans_gold = []\n",
    "    for l in lines:\n",
    "        if(l.startswith('Q: ')):\n",
    "            if(am is not None and a is not None):\n",
    "                questions.append(q)\n",
    "                ans_pred.append(am)\n",
    "                ans_gold.append(a)\n",
    "                # print(am)\n",
    "                # print(a)\n",
    "                if(test_answer_mmlu(am, a)):\n",
    "                    acc += 1\n",
    "            current_mode = 'q'\n",
    "            q = l\n",
    "            num_q += 1\n",
    "        elif(l.startswith('A_model:')):\n",
    "            current_mode = 'am'\n",
    "            am = l\n",
    "        elif(l.startswith('A:')):\n",
    "            current_mode = 'a'\n",
    "            a = l\n",
    "        else:\n",
    "            if(current_mode == 'q'): q += l\n",
    "            elif(current_mode == 'am'): am += l\n",
    "            elif(current_mode == 'a'): a += l\n",
    "            else:\n",
    "                raise ValueError(current_mode)\n",
    "                \n",
    "    questions.append(q)\n",
    "    ans_pred.append(am)\n",
    "    ans_gold.append(a)\n",
    "    # print(am)\n",
    "    # print(a)\n",
    "    if(test_answer_mmlu(am, a)):\n",
    "        acc += 1\n",
    "    print('num_q %d correct %d ratio %.4f' % (num_q, acc, float(acc / num_q)))\n",
    "    return questions, ans_pred, ans_gold\n",
    "\n",
    "def test_finished(ans_model):\n",
    "    if('answer is' in ans_model): return True\n",
    "    else: return False\n",
    "\n",
    "def extract_ans(ans_model):\n",
    "    ans_model = ans_model.split('\\n')\n",
    "    ans = []\n",
    "    residual = []\n",
    "    for li, al in enumerate(ans_model):\n",
    "        ans.append(al)\n",
    "        if('answer is' in al):\n",
    "            break\n",
    "    residual = list(ans_model[li + 1:])\n",
    "    ans = '\\n'.join(ans)\n",
    "    residual = '\\n'.join(residual)\n",
    "    return ans, residual"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "b398b88e-d8fd-47fc-be03-f4627f6719e1",
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "  0%|                                                                                                                                                                                        | 0/99 [00:00<?, ?it/s]\u001b[A\n",
      "  1%|█▊                                                                                                                                                                              | 1/99 [00:01<02:25,  1.48s/it]\u001b[A\n",
      "  2%|███▌                                                                                                                                                                            | 2/99 [00:02<02:09,  1.33s/it]\u001b[A\n",
      "  3%|█████▎                                                                                                                                                                          | 3/99 [00:04<02:13,  1.39s/it]\u001b[A\n",
      "  4%|███████                                                                                                                                                                         | 4/99 [00:05<02:24,  1.52s/it]\u001b[A\n",
      "  5%|████████▉                                                                                                                                                                       | 5/99 [00:07<02:27,  1.56s/it]\u001b[A\n",
      "  6%|██████████▋                                                                                                                                                                     | 6/99 [00:08<02:18,  1.49s/it]\u001b[A\n",
      "  7%|████████████▍                                                                                                                                                                   | 7/99 [00:09<01:57,  1.28s/it]\u001b[A\n",
      "  8%|██████████████▏                                                                                                                                                                 | 8/99 [00:10<01:53,  1.25s/it]\u001b[A\n",
      "  9%|████████████████                                                                                                                                                                | 9/99 [00:11<01:47,  1.20s/it]\u001b[A\n",
      " 10%|█████████████████▋                                                                                                                                                             | 10/99 [00:13<02:00,  1.36s/it]\u001b[A\n",
      " 11%|███████████████████▍                                                                                                                                                           | 11/99 [00:14<01:54,  1.31s/it]\u001b[A\n",
      " 12%|█████████████████████▏                                                                                                                                                         | 12/99 [00:16<01:55,  1.33s/it]\u001b[A\n",
      " 13%|██████████████████████▉                                                                                                                                                        | 13/99 [00:17<01:58,  1.38s/it]\u001b[A\n",
      " 14%|████████████████████████▋                                                                                                                                                      | 14/99 [00:18<01:52,  1.32s/it]\u001b[A\n",
      " 15%|██████████████████████████▌                                                                                                                                                    | 15/99 [00:20<01:51,  1.33s/it]\u001b[A\n",
      " 16%|████████████████████████████▎                                                                                                                                                  | 16/99 [00:21<01:43,  1.24s/it]\u001b[A\n",
      " 17%|██████████████████████████████                                                                                                                                                 | 17/99 [00:22<01:35,  1.16s/it]\u001b[A\n",
      " 18%|███████████████████████████████▊                                                                                                                                               | 18/99 [00:23<01:32,  1.14s/it]\u001b[A\n",
      " 19%|█████████████████████████████████▌                                                                                                                                             | 19/99 [00:24<01:27,  1.09s/it]\u001b[A\n",
      " 20%|███████████████████████████████████▎                                                                                                                                           | 20/99 [00:25<01:37,  1.23s/it]\u001b[A\n",
      " 21%|█████████████████████████████████████                                                                                                                                          | 21/99 [00:27<01:36,  1.24s/it]\u001b[A\n",
      " 22%|██████████████████████████████████████▉                                                                                                                                        | 22/99 [00:28<01:41,  1.32s/it]\u001b[A\n",
      " 23%|████████████████████████████████████████▋                                                                                                                                      | 23/99 [00:29<01:34,  1.24s/it]\u001b[A\n",
      " 24%|██████████████████████████████████████████▍                                                                                                                                    | 24/99 [00:30<01:31,  1.22s/it]\u001b[A\n",
      " 25%|████████████████████████████████████████████▏                                                                                                                                  | 25/99 [00:32<01:28,  1.19s/it]\u001b[A\n",
      " 26%|█████████████████████████████████████████████▉                                                                                                                                 | 26/99 [00:33<01:26,  1.18s/it]\u001b[A\n",
      " 27%|███████████████████████████████████████████████▋                                                                                                                               | 27/99 [00:34<01:33,  1.30s/it]\u001b[A\n",
      " 28%|█████████████████████████████████████████████████▍                                                                                                                             | 28/99 [00:36<01:36,  1.36s/it]\u001b[A\n",
      " 29%|███████████████████████████████████████████████████▎                                                                                                                           | 29/99 [00:38<01:45,  1.50s/it]\u001b[A\n",
      " 30%|█████████████████████████████████████████████████████                                                                                                                          | 30/99 [00:39<01:44,  1.51s/it]\u001b[A\n",
      " 31%|██████████████████████████████████████████████████████▊                                                                                                                        | 31/99 [00:41<01:47,  1.58s/it]\u001b[A\n",
      " 32%|████████████████████████████████████████████████████████▌                                                                                                                      | 32/99 [00:42<01:29,  1.34s/it]\u001b[A\n",
      " 33%|██████████████████████████████████████████████████████████▎                                                                                                                    | 33/99 [00:43<01:32,  1.40s/it]\u001b[A\n",
      " 34%|████████████████████████████████████████████████████████████                                                                                                                   | 34/99 [00:44<01:25,  1.31s/it]\u001b[A\n",
      " 35%|█████████████████████████████████████████████████████████████▊                                                                                                                 | 35/99 [00:46<01:25,  1.34s/it]\u001b[A\n",
      " 36%|███████████████████████████████████████████████████████████████▋                                                                                                               | 36/99 [00:47<01:20,  1.27s/it]\u001b[A\n",
      " 37%|█████████████████████████████████████████████████████████████████▍                                                                                                             | 37/99 [00:48<01:14,  1.19s/it]\u001b[A\n",
      " 38%|███████████████████████████████████████████████████████████████████▏                                                                                                           | 38/99 [00:49<01:17,  1.27s/it]\u001b[A\n",
      " 39%|████████████████████████████████████████████████████████████████████▉                                                                                                          | 39/99 [00:51<01:17,  1.29s/it]\u001b[A\n",
      " 40%|██████████████████████████████████████████████████████████████████████▋                                                                                                        | 40/99 [00:51<01:06,  1.13s/it]\u001b[A\n",
      " 41%|████████████████████████████████████████████████████████████████████████▍                                                                                                      | 41/99 [00:53<01:18,  1.35s/it]\u001b[A\n",
      " 42%|██████████████████████████████████████████████████████████████████████████▏                                                                                                    | 42/99 [00:55<01:17,  1.37s/it]\u001b[A\n",
      " 43%|████████████████████████████████████████████████████████████████████████████                                                                                                   | 43/99 [00:57<01:28,  1.59s/it]\u001b[A\n",
      " 44%|█████████████████████████████████████████████████████████████████████████████▊                                                                                                 | 44/99 [00:58<01:20,  1.46s/it]\u001b[A\n",
      " 45%|███████████████████████████████████████████████████████████████████████████████▌                                                                                               | 45/99 [00:59<01:12,  1.35s/it]\u001b[A\n",
      " 46%|█████████████████████████████████████████████████████████████████████████████████▎                                                                                             | 46/99 [01:01<01:22,  1.55s/it]\u001b[A\n",
      " 47%|███████████████████████████████████████████████████████████████████████████████████                                                                                            | 47/99 [01:02<01:15,  1.45s/it]\u001b[A\n",
      " 48%|████████████████████████████████████████████████████████████████████████████████████▊                                                                                          | 48/99 [01:04<01:17,  1.52s/it]\u001b[A\n",
      " 49%|██████████████████████████████████████████████████████████████████████████████████████▌                                                                                        | 49/99 [01:05<01:07,  1.34s/it]\u001b[A\n",
      " 51%|████████████████████████████████████████████████████████████████████████████████████████▍                                                                                      | 50/99 [01:06<01:08,  1.40s/it]\u001b[A\n",
      " 52%|██████████████████████████████████████████████████████████████████████████████████████████▏                                                                                    | 51/99 [01:09<01:23,  1.74s/it]\u001b[A\n",
      " 53%|███████████████████████████████████████████████████████████████████████████████████████████▉                                                                                   | 52/99 [01:10<01:13,  1.57s/it]\u001b[A\n",
      " 54%|█████████████████████████████████████████████████████████████████████████████████████████████▋                                                                                 | 53/99 [01:12<01:20,  1.75s/it]\u001b[A\n",
      " 55%|███████████████████████████████████████████████████████████████████████████████████████████████▍                                                                               | 54/99 [01:14<01:13,  1.62s/it]\u001b[A\n",
      " 56%|█████████████████████████████████████████████████████████████████████████████████████████████████▏                                                                             | 55/99 [01:15<01:02,  1.43s/it]\u001b[A\n",
      " 57%|██████████████████████████████████████████████████████████████████████████████████████████████████▉                                                                            | 56/99 [01:17<01:10,  1.64s/it]\u001b[A\n",
      " 58%|████████████████████████████████████████████████████████████████████████████████████████████████████▊                                                                          | 57/99 [01:18<01:04,  1.54s/it]\u001b[A\n",
      " 59%|██████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                                        | 58/99 [01:20<01:07,  1.66s/it]\u001b[A\n",
      " 60%|████████████████████████████████████████████████████████████████████████████████████████████████████████▎                                                                      | 59/99 [01:21<00:58,  1.45s/it]\u001b[A\n",
      " 61%|██████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                     | 60/99 [01:23<00:58,  1.51s/it]\u001b[A\n",
      " 62%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                                                   | 61/99 [01:24<00:55,  1.45s/it]\u001b[A\n",
      " 63%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                                 | 62/99 [01:24<00:43,  1.18s/it]\u001b[A\n",
      " 64%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                                                               | 63/99 [01:26<00:41,  1.16s/it]\u001b[A\n",
      " 65%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                                             | 64/99 [01:27<00:45,  1.31s/it]\u001b[A\n",
      " 66%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                                            | 65/99 [01:29<00:45,  1.34s/it]\u001b[A\n",
      " 67%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                                          | 66/99 [01:30<00:43,  1.31s/it]\u001b[A\n",
      " 68%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                                        | 67/99 [01:32<00:47,  1.50s/it]\u001b[A\n",
      " 69%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                                      | 68/99 [01:33<00:45,  1.48s/it]\u001b[A\n",
      " 70%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                                     | 69/99 [01:35<00:45,  1.51s/it]\u001b[A\n",
      " 71%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                                   | 70/99 [01:35<00:36,  1.24s/it]\u001b[A\n",
      " 72%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                 | 71/99 [01:37<00:39,  1.41s/it]\u001b[A\n",
      " 73%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                                               | 72/99 [01:39<00:40,  1.51s/it]\u001b[A\n",
      " 74%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                              | 73/99 [01:41<00:39,  1.51s/it]\u001b[A\n",
      " 75%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                            | 74/99 [01:42<00:36,  1.46s/it]\u001b[A\n",
      " 76%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                          | 75/99 [01:44<00:39,  1.65s/it]\u001b[A\n",
      " 77%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                                        | 76/99 [01:46<00:38,  1.67s/it]\u001b[A\n",
      " 78%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                       | 77/99 [01:47<00:34,  1.57s/it]\u001b[A\n",
      " 79%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                     | 78/99 [01:48<00:28,  1.37s/it]\u001b[A\n",
      " 80%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                   | 79/99 [01:50<00:28,  1.44s/it]\u001b[A\n",
      " 81%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                 | 80/99 [01:51<00:26,  1.42s/it]\u001b[A\n",
      " 82%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                               | 81/99 [01:52<00:26,  1.45s/it]\u001b[A\n",
      " 83%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                              | 82/99 [01:54<00:23,  1.37s/it]\u001b[A\n",
      " 84%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                            | 83/99 [01:55<00:23,  1.45s/it]\u001b[A\n",
      " 85%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                          | 84/99 [01:56<00:20,  1.37s/it]\u001b[A\n",
      " 86%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                        | 85/99 [01:58<00:19,  1.43s/it]\u001b[A\n",
      " 87%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                       | 86/99 [02:00<00:20,  1.61s/it]\u001b[A\n",
      " 88%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                     | 87/99 [02:01<00:18,  1.51s/it]\u001b[A\n",
      " 89%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                   | 88/99 [02:03<00:15,  1.45s/it]\u001b[A\n",
      " 90%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                 | 89/99 [02:04<00:13,  1.31s/it]\u001b[A\n",
      " 91%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                | 90/99 [02:04<00:10,  1.13s/it]\u001b[A\n",
      " 92%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊              | 91/99 [02:05<00:08,  1.03s/it]\u001b[A\n",
      " 93%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋            | 92/99 [02:06<00:06,  1.13it/s]\u001b[A\n",
      " 94%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍          | 93/99 [02:07<00:05,  1.01it/s]\u001b[A\n",
      " 95%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏        | 94/99 [02:07<00:04,  1.17it/s]\u001b[A\n",
      " 96%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉       | 95/99 [02:09<00:03,  1.04it/s]\u001b[A\n",
      " 97%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋     | 96/99 [02:10<00:03,  1.17s/it]\u001b[A\n",
      " 98%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍   | 97/99 [02:13<00:03,  1.54s/it]\u001b[A\n",
      " 99%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 98/99 [02:14<00:01,  1.62s/it]\u001b[A\n",
      "100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 99/99 [02:15<00:00,  1.37s/it]\u001b[A\n"
     ]
    }
   ],
   "source": [
    "i = 0\n",
    "with open('outputs/test_gpt_3.5_turbo_%s.txt' % task, 'w') as fd:\n",
    "    for q_ in tqdm(task_data['test'], total=len(task_data['test'])):\n",
    "        q = q_['input'] + '\\n'\n",
    "        for letter in ['A', 'B', 'C', 'D']:\n",
    "            q += '(' + letter + ') ' + q_[letter] + ' '\n",
    "        q += \"\\nA: Let's think step by step.\"  \n",
    "            \n",
    "        prompt_q = mmlu_prompt[task] + \"\\n\\n\" + q\n",
    "\n",
    "        response = completion_with_backoff(\n",
    "              model=\"gpt-3.5-turbo\",\n",
    "              messages=[\n",
    "                    {\"role\": \"system\", \"content\": \"Follow the given examples and answer the question.\"},\n",
    "                    {\"role\": \"user\", \"content\": prompt_q},\n",
    "                ],\n",
    "            temperature=0\n",
    "            )\n",
    "        ans_model = response['choices'][0]['message']['content']\n",
    "        ans_, residual = extract_ans(ans_model)\n",
    "            \n",
    "        a = q_['target']\n",
    "        fd.write('Q: %s\\nA_model:\\n%s\\nA:\\n%s\\n\\n' % (q, ans_, a))\n",
    "        i += 1\n",
    "        # if(i == 2): break"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "8ea5012b-444a-49d8-8752-51ea485d9beb",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "business_ethics\n",
      "num_q 99 correct 67 ratio 0.6768\n"
     ]
    }
   ],
   "source": [
    "print(task)\n",
    "_, _, _ = parse_pred_ans('outputs/test_gpt_3.5_turbo_%s.txt' % task)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
